Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Mr. Ashwin Ramteke, Harsh Suryawanshi, Anuj Zanwar, Prathmesh Yadav
DOI Link: https://doi.org/10.22214/ijraset.2023.52752
Certificate: View Certificate
Speech emotion recognition is a very interesting but very challenging human-computer interaction task. In recent years, this topic has attracted a lot of attention. In the field of speech emotion recognition, many techniques have been used to extract emotion from signals, including many well-established speech analysis and classification techniques. In the traditional way of speech emotion recognition, the emotion recognition features are extracted from the speech signals, and then the features, which are collectively known as the selection module, are selected, and then the emotions are recognized, which is a very tedious and time-consuming process, so this paper provides an overview of the deep learning technique, which is based on a simple algorithm based on feature extraction and building a model that recognizes emotions
I. INTRODUCTION
A. Introduction
Speech emotion recognition (SER) is the task of recognizing the emotional aspects of speech independently of the semantic content. While humans can efficiently perform this task as a natural part of voice communication, the ability to use programmable devices to perform this automatically is still a subject of research. Call centre operators and customers, drivers, pilots, and many other users of human-machine communication. gain. Adding emotion to machines is recognized as a key factor in making them look and behave like humans. Robots that are able to understand emotions can display appropriate emotional responses and display emotional personalities. In certain situations, appealing to human emotions can replace humans with computer-generated characters who can have very natural and compelling conversations. Machines need to understand emotions conveyed through language. Without this capability, absolutely meaningful human-machine interaction based on mutual trust and understanding is possible. Machine learning (ML) traditionally involves computing feature parameters from raw data (audio, images, videos, ECG, EEG, etc.). Features are used to train a model that learns to produce the desired output labels. A common problem with this approach is feature selection. In general, the characteristics that most efficiently group data into different categories (or classes) are unknown. Some insights can be gained by testing a large number of different features, combining different features into a common feature vector, and applying different feature selection techniques. The quality of the resulting hand-crafted features can have a significant impact on classification performance. The advent of deep neural network (DNN) classifiers has provided an elegant solution to circumvent the problem of optimal feature selection. The idea is to use an end-to-end network that takes raw data as input and produces class labels as output. There is no need to compute hand-crafted features or determine optimal parameters from a classification standpoint. The network itself does everything. In other words, the network parameters (i.e. the weights and bias values assigned to the network nodes) are optimized during the training process and act as a function to efficiently split the data into desired categories. make. This very useful solution requires a larger number of tagged data samples than traditional classification methods.
B. Scope and Objectives
In this article, we explored how speech data can be used in real-world applications such as automatic speech recognition (ASR) and speech recognition (SER). I researched open source Python packages that support ASR and pitched project ideas. We also considered building a robust SER model using the TORONTO EMOTION SPEECH SET (TESS) dataset to train the LSTM model. This hands-on experience will allow you to start building projects and master SER concepts. Researchers use a variety of speech processing techniques to capture this hidden layer of information so that they can enhance and extract timbral and acoustic features from speech. Converting audio signals to digital or vector format is not as easy as it is for images. The 11 conversion method determines how much key information is retained when exiting the "audio" format. If the given data transformations do not capture softness and serenity, it is difficult for the model to learn emotions and classify patterns.
A method for converting audio data into numbers is the mel spectrogram, which visualizes an audio signal based on its frequency content. This is represented as a sound wave and is input to train a CNN as an image classifier. This can be captured by the Mel-Frequency Cepstral Coefficients (MFCC). Each of these data formats has advantages and disadvantages depending on the application. It retrieves data from MFCC and tries to display the data in the appropriate field format used in the model. For example, here we use an LSTM model for feature detection. It takes numbers as input from an MFCC-LSTM model and tries to recognize emotions.
II. DESIGN FLOW/PROCESS
The Following Emotions can be Classified:
A. Traditional Techniques
Emotion recognition systems based on digitized speech is comprised of three fundamental components: signal pre-processing, feature extraction, and classification . Acoustic pre-processing such as denoising, as well as segmentation, is carried out to determine meaningful units of the signal . Feature extraction is utilized to identify the relevant features available in the signal. Lastly, the mapping of extracted feature vectors to relevant emotions is carried out . List of nomenclature used in this review paper. by classifiers. In this section, a detailed discussion of speech signal processing, feature extraction, and classification is provided . Also, the differences between spontaneous and acted speech are discussed due to their relevance to the topic. Figure below depicts a simplified system utilized for speech-based emotion recognition. In the first stage of speech-based signal processing, speech enhancement is carried out where the noisy components are removed. The second stage involves two parts, feature extraction, and feature selection.
The required features are extracted from the pre-processed speech signal and the selection is made from the extracted features. Such feature extraction and selection is usually based on the analysis of speech signals in the time and frequency domains. During the third stage, various classifiers such as GMM and HMM, etc. are utilized for classification of these features. Lastly, based on feature classification different emotions are recognized.
B. Database used
The dataset used in this is TESS(Toronto emotional speech set) which consists of 200 target words uttered by two actresses (ages 26 and 64) in their career phrase "Say the word_", a sentence representing each of seven emotions (anger, disgust, fear, happiness). was recorded. , represents pleasant surprise, sadness, or neutral). There are 2800 data points (audio files) in total. The dataset is organized to contain each of the two actresses and their emotions in their own folder. It contains all 200 target words of the audio file. The format of the audio file is WAV format.
C. Feature Extraction
For Extracting acoustic features we use MFCC. Mel frequency cepstral coefficients (MFCC) were originally designed to identify monosyllabic words in coherently spoken sentences, but not to identify the speaker. The MFCC calculation is a replication of the human auditory system to artificially implement the working principle of the ear with the assumption that the human ear is a reliable speaker recognizer. MFCC functions are rooted in the recognized contradiction of the critical bandwidths of the human ear, with frequency filters spaced linearly at low frequencies and logarithmically at high frequencies have been used to preserve the phonetically vital properties of the speech signal. Speech signals commonly contain tones of different frequencies, each tone with a real frequency f (Hz), and the subjective pitch is calculated on the Mel scale. The Mel-frequency scale has a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. A pitch of 1 kHz and 40 dB above the threshold of perceptual audibility is defined as 1000 mel and is used as a reference point.
D. Creating Model
Long Short Term Memory (LSTM) is an artificial neural network used in artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTMs have a feedback connection. These recurrent neural networks (RNNs) can process entire data sequences (eg voice or video) as well as individual data points (eg images). This property makes LSTM networks ideal for data processing and prediction. For example, LSTMs can be applied to tasks such as unsegmented and connected handwriting recognition, speech recognition, machine translation, voice activity detection, robot control, video games, and medical care.The name LSTM refers to the analogy that standard RNNs have both "long-term memory" and "short-term memory". The weights and biases of connections in the network change once per training episode, similar to how physiological changes in synaptic strength preserve long-term memories. The network's activation patterns change once per time step, similar to how minute-by-minute changes in the brain's electrical arousal patterns preserve short-term memory. The LSTM architecture aims to provide short-term memory for RNNs spanning thousands of time steps, or "long short-term memory".
III. LITERATURE SURVEY
TABLE I
SURVEY PAPERS FOR DIFFERENT MODELS AND TECHNIQUES IMPLEMENTED
S.No. |
References |
Emotion recognized |
Database used |
Deep Learning approach used |
Contribution towards Emotion recognition and accuracy |
Future Direction |
1 |
J. Zhao et. al (2019) [14] |
Happiness, Sadness, Neutral, Surprise, Disgust, Fear, and Anger |
Berlin EmoDB and IEMOCAP |
CNN and DBN with four LFLBS and one LSTM |
Deep ID and 2D CNN LSTM to achieve 91.6% and 92.9% accuracy |
The presented model can be extended to multimodal emotion recognition |
2 |
E. Lakomkin et. al (2018) [15] |
Anger, Happiness, Neutral and Sadness |
IEMOCAP database |
RNN and CNN |
Combine RNN-CNN for iCLub robot and in-domain data with 83.2% accuracy |
Future work lead to use may of generative models for real-time data input |
3 |
S. Sahu et. al (2018) [16] |
Anger, Happiness, Neutral and Sadness |
IEMOCAP database |
Adversarial auto- encoders (AAE) |
Accuracy with UAR of approximately 57.88% |
Future work may lead to recognizing other emotions |
4 |
M. Chen et. al (2018) [17] |
Anger, Sadness, Happy, Neutral, Fear, Disgust and Bored |
IEMOCAP And Emo-DB databases |
3-D CNN LSTM to discriminative features |
The model achieves an overall of 86.99% |
Future work includes the testing on different databases |
5 |
S. E. Eskimez et. al (2018) [18] |
Anger, Frustration, Neutral, and Sadness |
USC-IEMOCAP audio-visual dataset |
CNN with VAE, AAE and AVB |
Better achievement on F-1 score level as 47% |
Future work may include RNN |
6 |
W. Zhang et. (2017) [19] |
Fear, Anger, Neutral, Joy, Surprise and Sadness
|
CAS Emotional speech database |
Features fusion with SVM and Deep Belief Network for SER is carried out |
DBM provides accuracy of 94.6% as compared to SVM that is 84.54% |
Future direction may leads to more train DBN with combination of lexical features and audio features |
7 |
P. Tzirakis et. al(2017) [20] |
Anger, Happiness, Sadness, Neutral |
Spontanous emotional RECOLA and AVEC 2016 Database |
Convolutional Neural Network (CNN) and ResNet of 50 layers for both audio visual modality along with LSTM |
End-to-end methodology is used to recognize various emotions with 78.7% ac- curacy |
Future work involves the application of proposed model on other databases |
8 |
H.M. Fayek et. al (2017) [21] |
Anger, Happy, Neutral, Sad and Silence |
IEMOCAP Database |
Feed forward in combination with Recurrent Neural Network (RNN) and CNN to recognize emotion from speech |
Proposed SER technique relies on minimal speech processing and frame based end-to- end deep learning 64.78% accuracy |
Future work be to apply the may same model to other databases |
9 |
Y. Zhang et. (2017) [22] |
Happiness, Neutral, Disgust, Sadness, Surprise, Fear, and Anger |
ComParE and feature eGeMAPS set
|
multi-tasking DNNs shared with few(MT- SHL-DNN) hidden layers |
multi-tasking DNN has been used for the very first time with 93.6% accuracy |
Can use multiple Softmax in combination with linear layers for classification and regression tasks
|
10 |
Y. Zhang et. (2017) [23] |
Neutral, Anger, Fear, Disgust, Sadness, Joy and Happiness |
IEMOCAP for emotion recognition and TIMIT for phoneme recognition |
Recurrent Convolutional Neural Net- work (RCNN) |
This hybrid model has been applied for SER for the very first time with 83.4% accuracy |
Possibility of making this model more generic and efficient Cross modal deep learnings |
11 |
Z. Q. Wang and I. Ta- shev (2017) [24] |
Happiness, Neutral, Disgust, Sadness, Surprise, Fear, and Anger |
Mandarin dataset |
DNN based kernel extreme learning ma- chine (ELM) |
DNN-ELM approach provides 3.8% weighted accuracy and 2.94% unweighted accuracy in SER |
Future works include testing on other corpus |
IV. SYSTEM REQUIREMENTS
A. Functional Requirements
B. Non-Functional Requirements
a. Software Requirements
b. Hardware Requirements
V. RESULTS
The project report presents the results of a speech emotion classification system using Mel-frequency cepstral coefficients (MFCCs) and Long Short-Term Memory (LSTM) networks. The trained model achieved an accuracy of 100% on the test dataset, demonstrating its effectiveness in accurately categorizing emotions from speech signals. The evaluation metrics, including precision, recall, and F1-score, consistently showed high values across different emotion classes. The combination of MFCCs and LSTM networks proved successful in capturing the nuanced variations and patterns associated with different emotional states. The study highlights the potential of this approach in areas such as affective computing and human-computer interaction. Further enhancements and optimizations can improve the system's accuracy and robustness, leading to a deeper understanding of human emotions in real-world scenarios.
Tests were carried out to check the functioning of the following:
VI. ACKNOWLEDGEMENT
The satisfaction that accompanies the successful completion and it asks would be incomplete without the mention of the people who made it possible and whose constant encouragement and guidance have been a source of inspiration throughout the course of this project. We take this opportunity to express my sincere thanks to my guide respected Mr. Ashwin Ramteke sir for their support and encouragement throughout the completion of this project. Finally, I would like to thank all the teaching and non-teaching faculty members and lab staff of the department of electronics and communication engineering for their encouragement. I also extend our thanks to all those who helped us directly or indirectly in the completion of this project.
In this project we have tried to analyse some samples of speech using the deep learning technique. Firstly we loaded the datasets then we visualized the different human emotions using our functions wave show and spectrogram using the Librosa library. Then we extracted the acoustic features of all our samples using the MFCC method and arranged the sequential data obtained in the 3D array form as accepted by the LSTM model. Then we build the LSTM model and after training the model we visualized the data into the graphical form using matplotlib library and after some repeated testing using different values the average accuracy of the model is found to be 71%
[1] s. Cao et al. [1] proposed a ranking SVM method for synthesize information about emotion recognition to solve the problem of binary classification. [2] Chen et al. [2] aimed to improve speech emotion recognition in speaker-independent with three level speech emotion recognition method. [3] Nwe et al. [3] proposed a new system for emotion classification of utterance signals. The system employed a short time log frequency power coefficients (LFPC) and discrete HMM to characterize the speech signals and classifier respectively. [4] T Wu et al. [4] proposed a new modulation spectral features (MSFs) human speech emotion recognition. [5] Rong et al. [5] presented an ensemble random forest to trees (ERFTrees) method with a high number of features for emotion recognition without referring any language or linguistic information remains an unclosed problem. [6] Wu et al. [6] proposed a fusion-based method for speech emotion recognition by employing multiple classifier and acoustic-prosodic (AP) features and semantic labels (SLs). [7] Narayanan [7] proposed domain-specific emotion recognition by utilizing speech signals from call center application. Detecting negative and non-negative emotion (e.g. anger and happy) are the main focus of this research. [8] Yang & Lugger [8] presented a novel set of harmony features for speech emotion recognition. These features are relying on psychoacoustic perception from music theory. [9] Albornoz et al. [9] investigate a new spectral feature in order to determine emotions and to characterize groups. [10] Lee et al. [10] represent a hierarchical computational structure to identify emotions. Lee et al. [11-12] proposed hierarchical structure for binary decision tree in emotion recognition fields. [11] Yeh et al. [13] proposed a segment based method for recognition of emotion in Mandarin speech [12] Dai et al. [14] proposed a computational approach for recognition of emotion and analysis the specifications of emotion in voiced social media such as WeChatt. [13] El Ayadi et al. [15] proposed a Gaussian mixture vector autoregressive (GMVAR) approach, which is mixture of GMM with vector autoregressive for classification problem of speech emotion recognition. [14] B. W. Schuller, ‘‘Speech emotion recognition: Two decades in a nut shell, benchmarks, and ongoing trends,’’ Commun. ACM, vol. 61, no. 5, pp. 90–99, 2018. [15] N. D. Lane and P. Georgiev, ‘‘Can deep learning revolutionize mobile sensing?’’ in Proc. ACM 16th Int. Workshop Mobile Comput. Syst. Appl., 2015, pp. 117–122. [16] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, ‘‘Speech recognition using deep neural networks: A systematic review,’’ IEEE Access, vol. 7, pp. 19143–19165, 2019. [17] K. R. Scherer, ‘‘What are emotions? And how can they be measured?’’ Social Sci. Inf., vol. 44, no. 4, pp. 695–729, 2005 [18] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. G. Taylor, ‘‘Emotion recognition in human-computer interaction,’’ IEEE Signal Process. Mag., vol. 18, no. 1, pp. 32–80, Jan. 2001. [19] R. W. Picard, ‘‘Affective computing,’’ Perceptual Comput. Sect., Media Lab., MIT, Cambridge, MA, USA, Tech. Rep., 1995. [20] M. El Ayadi, M. S. Kamel, and F. Karray, ‘‘Survey on speech emotion recognition: Features, classification schemes, and databases,’’ Pattern Recognit., vol. 44, no. 3, pp. 572–587, 2011. [21] A. D. Dileep and C. C. Sekhar, ‘‘GMM-based intermediate matching kernel for classification of varying length patterns of long duration speech using support vector machines,’’ IEEE Trans. neural Netw. Learn. Syst., vol. 25, no. 8, pp. 1421–1432, Aug. 2014 [22] A. Batliner, B. Schuller, D. Seppi, S. Steidl, L. Devillers, L. Vidrascu, T. Vogt, V. Aharonson, and N. Amir, ‘‘The automatic recognition of emotions in speech,’’ in Emotion-Oriented Systems. Springer, 2011, pp. 71–99 [23] E. Mower, M. J. Mataric, and S. Narayanan, ‘‘A framework for auto matic human emotion classification using emotion profiles,’’ IEEE Trans. Audio, Speech, Language Process., vol. 19, no. 5, pp. 1057–1070, Jul. 2011. [24] J. Han, Z. Zhang, F. Ringeval, and B. Schuller, ‘‘Prediction-based learning for continuous emotion recognition in speech,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Mar. 2017, pp. 5005–5009.
Copyright © 2023 Mr. Ashwin Ramteke, Harsh Suryawanshi, Anuj Zanwar, Prathmesh Yadav. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET52752
Publish Date : 2023-05-22
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here